A Quick Guide to Benchmarking AI Models on Azure: ResNet with MLPerf Training v3.0
Published Jun 28 2023 04:19 PM 4,971 Views
Microsoft

By Hugo Affaticati (Technical Program Manager), Sonal Doomra (Technical Program Manager 2), and Jon Shelley (Principal TPM Manager).

 

Introduction

Azure is pleased to showcase results from our MLPerf Training v3.0 submission. For this submission, we benchmarked our ND H100 v5 virtual machine (preview), with innovative technologies like:

  • 8x NVIDIA H100 Tensor Core GPUs interconnected via next gen NVSwitch and NVLink 4.0
  • 400 Gb/s NVIDIA Quantum-2 CX7 InfiniBand per GPU with 3.2Tb/s per VM in a non-blocking fat-tree network
  • NVSwitch and NVLink 4.0 with 3.6TB/s bisectional bandwidth between 8 local GPUs within each VM
  • 4th Gen Intel Xeon Scalable processors
  • PCIE Gen5 host to GPU interconnect with 64GB/s bandwidth per GPU
  • 16 Channels of 4800MHz DDR5 DIMMs

Full results on MLCommons website.

 

How to replicate the results in Azure

 

Pre-requisites:

Deploy and set up an ND H100 v5 virtual machine on Azure using Azure Portal or Azure CycleCloud.

 

Set up the environment

First, one needs to download the container from NVIDIA NGC (account needed). Then, one can clone the code from MLCommon's GitHub repository specific to Azure and publicly available.

cd /share
docker pull nvcr.io/nvdlfwea/mlperfv30/resnet:20230428.mxnet
git clone https://github.com/mlcommons/training_results_v3.0.git
cd /share/training_results_v3.0/Azure/benchmarks/resnet/implementations/ND_H100_v5

 

Get the dataset for ResNet

ResNet utilizes the ImageNet dataset from 2012. One will need both Training images (Task 1 & 2) and Validation images (all tasks) for MLPerf training v3.0. 

For the Training images:

mkdir /share/data && cd /share/data
wget https://image-net.org/data/ILSVRC/2012/ILSVRC2012_img_train.tar
mkdir train && mv ILSVRC2012_img_train.tar train/ && cd train
tar -xvf ILSVRC2012_img_train.tar && rm -f ILSVRC2012_img_train.tar
find . -name "*.tar" | while read NAME ; do mkdir -p "${NAME%.tar}"; tar -xvf "${NAME}" -C "${NAME%.tar}"; rm -f "${NAME}"; done
cd ..

For the Validation images:

wget https://image-net.org/data/ILSVRC/2012/ILSVRC2012_img_val.tar
mkdir val && mv ILSVRC2012_img_val.tar val/ && cd val
tar -xvf ILSVRC2012_img_val.tar && rm -f ILSVRC2012_img_val.tar
wget -qO- https://raw.githubusercontent.com/soumith/imagenetloader.torch/master/valprep.sh | bash

 

Run the ResNet benchmark

The steps to run the benchmark consist of sourcing the configuration file, and starting the benchmark. 

cd /share/training_results_v3.0/Azure/benchmarks/resnet/implementations/ND_H100_v5
source config_DGXH100.sh
CONT=nvcr.io/nvdlfwea/mlperfv30/resnet:20230428.mxnet DATADIR=/share/data LOGDIR=results ./run_with_docker.sh

The above steps can be replicated for the other MLPerf Training v3.0 benchmarks. One would have to use the corresponding configuration file and steps to preprocess the data.

 

#AzureHPCAI #MakeAIYourReality

Version history
Last update:
‎Nov 08 2023 06:05 PM
Updated by: